Optical flow estimation method

ABSTRACT

The optical flow in an image is estimated using only parameters extracted directly from a compressed video data stream, substantially eliminating the need to decode the video data stream for this purpose. Coefficients extracted from the compressed video data stream are used to establish a confidence map indicative of the edge strength within the image data and hence the accuracy of the associated motion field. A smooth motion field is generated from the motion vectors inherent in the compressed video data stream. The motion field is then used to update the confidence map between frames, providing an estimate of the optical flow.

This invention relates to optical flow estimation, and, in particular,optical flow estimation in a compressed video data stream.

BACKGROUND OF THE INVENTION

One of the problems of image processing lies in distinguishingforeground objects from background images in video data. Applications inareas as diverse as video processing, video compressing or machinevision rely on effective segmentation techniques to perform theirdesired tasks. Motion segmentation exploits the temporal correlation ofconsecutive video images and detects image regions with differentmotion. This two dimensional motion, usually called apparent motion oroptical flow, needs to be recovered from image intensity and colourinformation in a video sequence.

In general, depending on the target application, one can tradeoptimisation performance (accuracy) against computational load(efficiency). Some specific applications need very high efficiency dueto real-time requirements and practical feasibility. Surveillanceapplications, such as a pedestrian detection system for undergroundtrain stations, are an example of a situation in which a controlledenvironment (fixed camera, controlled illumination) allied with costrequirements (large numbers of cameras and necessity of fast responsetimes) is a good target for high efficiency algorithms. Such a system islikely to use one of the popular available video encoding standards thatalready use some form of motion estimation designed for compressionpurposes.

Horn and Schunck, “Determining Optical Flow”, in AL Memo 572,Massachusetts Institute of Technology, 1980 defines optical flow as “thedistribution of apparent velocities of movement of brightness patternsin an image”. This definition assumes that all changes in the image arecaused by the translation of these brightness patterns, leading to thegradient constraint equation, involving spatial and temporal gradientsand an optical flow velocity.

This velocity is a two dimensional approximation of the real scenemovement in the image plane that may be termed real velocity. Thegradient constraint equation requires additional constraints forresolution. Horn and Schunck (above) use a global smoothness term tosolve this problem, while Lucas and Kanade (“An Iterative ImageRegistration Technique with an Application to Stereo Vision”, Proc. Ofthe Imaging Understanding Workshop 1981 pp 121-130) use a weightedleast-squares fit of local first-order constraints assuming that imagegradient is almost constant for local neighbourhoods. The Lucas Kanademethod generates matrix eigenvalues, the magnitude of the eigenvaluesbeing directly related to the strength of edges in the image and theeigenvalues being used to create a confidence map of optical flowaccuracy.

A confidence map is a set of data which stores the confidence, orvariance, at each pixel for the accuracy of the optical flow field.

Both of the above methods are called differential methods since they usethe gradient constraint equation directly to estimate optical flow. Thelargest problem of such differential methods is that they cannot beapplied to large motions because a good initial value is required.

U.S. Pat. No. 6,456,731 discloses an optical flow estimation methodwhich incorporates a known hierarchically-structured Lucas Kanade methodfor interpolating optical flow between regions having differentconfidence values.

MPEG-2 video encoding allows high-quality video to be encoded,transmitted and stored and is achieved by eliminating spatial andtemporal redundancy that typically occurs in video streams.

In MPEG-2 encoding, the image is divided in 16×16 areas calledmacroblocks, and each macroblock is divided into four 8×8 luminanceblocks and eight, four or two chrominance blocks according to a selectedchroma key. A discrete-cosine transform (DCT), an invertible discreteorthogonal transformation (see “Generic Coding of Moving Pictures andAssociated Audio”, Recommendation H.262, ISO/IEC 13818-2, CommitteeDraft MPEG-2), is applied to each 8×8 luminance block giving a matrixthat is mostly composed of zeros (high-frequency power) and a smallnumber of non-zero values. The quantization step that followseffectively controls compression ratios by discarding more or lessinformation according to the value of the quantization scale. Zig-zagand Huffman coding exploit the resulting high-number of zero values andcompress the image data.

Temporal redundancy is quite severe in video since consecutive imagesare very similar. To achieve even better compression each macroblock iscompared not to its direct spatial equivalent in a previous image but toa translated version of it (to compensate for movement in the scene)that is found using a block-matching algorithm. The translation detailsare stored in a motion vector that refers to either a previous image ora following image depending on the picture type.

MPEG-2 encoding defines three kinds of image data: intra-coded framedata (I pictures) with only spatial compression (no motion vectors),predicted frame data (P pictures) and bi-directionally interpolatedframe data (B pictures) with motion estimation.

I pictures only have intra-coded macroblocks (macroblocks without motionestimation) because they are coded without reference to other pictures.P and B pictures can also include inter-coded macroblocks (macroblockswhere only the difference to the original macroblock designated by themotion vector is encoded). P pictures are coded more efficiently usingmotion compensated prediction from a past I or P picture and aregenerally used as a reference for future prediction. B pictures providethe highest degree of compression but require both past and futurereference pictures for motion compensation; they are never used asreferences for prediction.

FIG. 1 illustrates the typical picture sequence of an MPEG-2 compressedvideo data stream. The organisation of the three picture types, I, B andP pictures, in a video stream is flexible, the choice being left to theencoder and being dependent on the requirements of the application. Inthe particular example shown the pictures are in the sequence IBBPBBP,and the arrows represent the direction in which pictures are estimated.I pictures are used to predict B and P pictures. P pictures are used topredict prior and following B pictures. B pictures are not used asreferences for other pictures.

U.S. Pat. No. 6,157,396 discloses a system for improving the quality ofdigital video using a multitude of techniques and focussing on MPEG-2compressed video data. The system aims to enhance standard compressedMPEG-2 decoding by using a number of additional processes, includingretaining groups of pictures (GOP) and the motion vector information, toaid post decompression filtering in the image reconstruction (IR) anddigital output processor (DOP). The system decompresses the image butretains the motion vectors for later use in the DOP. Supplementalinformation, such as a layered video stream, instructional cues andimage key meta data, is used to enhance the quality of the decoded imagethrough post decompression filtering. However this system relies ondecompression of the MPEG-2 compressed video data which isdisadvantageous in that it tends to increase computational complexityand decrease processing speed.

It is an object of the present invention to provide fast, reasonablyaccurate two-dimensional motion estimation of a video scene forapplications in which it is desired to avoid high computational costsand compressed digital video data is used.

It is a further object of the present invention to provide such motionestimation which closely approximates the Lucas-Kanade method of opticalflow estimation, but working only with compressed video data.

SUMMARY OF THE INVENTION

According to a first aspect of the present invention there is providedan optical flow estimation method comprising the steps of obtainingencoded image data representative of an image sequence of a changingobject having a motion field; extracting from said encoded image datafirst frame data blocks not incorporating motion vector encoding;extracting from said encoded image data second frame data blocksincorporating motion vector encoding; determining from said first framedata blocks confidence map data indicative of the edge strength withinsaid encoded the image data and hence the accuracy of the motion field;deriving from said second frame data blocks smooth motion field datablocks in which each data block has a single motion vector and themagnitudes of the motion vectors are normalised; and

updating the confidence map data on the basis of the smooth motion fielddata blocks to provide output data indicative of the optical flow of theimage.

In one embodiment of the invention, the encoded image data is encoded inthe MPEG-2 video data format. However the invention is applicable to anycompressed domain representation in which motion vectors are encoded.

According to a second aspect of the present invention there is providedan optical flow estimation system utilising encoded image datarepresentative of an image sequence of a changing object having a motionfield, the system comprising first extraction means for extracting fromsaid encoded image data first frame data blocks not incorporating motionvector encoding; second extraction means for extracting from saidencoded image data second frame data blocks incorporating motion vectorencoding; determination means for determining from said first frame datablocks confidence map data indicative of the edge strength within saidencoded image data and hence the accuracy of the motion field;derivation means for deriving from said second frame data blocks smoothmotion field data blocks in which each data block has a single motionvector and the magnitudes of the motion vectors are normalised; andupdating means for updating said confidence map data on the basis ofsaid smooth motion field data blocks to provide output data indicativeof the optical flow of the image.

According to a third aspect of the present invention there is providedcomputer readable recording medium on which is recorded an optical flowestimation program for causing a computer to execute the followingsteps: extracting, from encoded image data representative of an imagesequence of a changing object having a motion field, first frame datablocks not incorporating motion vector encoding; extracting from saidencoded image data second frame data blocks incorporating motion vectorencoding; determining from said first frame data blocks confidence mapdata indicative of the edge strength within said encoded image data andhence the accuracy of the motion field; deriving from said second framedata blocks smooth motion field data blocks in which each data block hasa single motion vector and the magnitudes of the motion vectors arenormalised; and updating the confidence map data on the basis of thesmooth motion field data blocks to provide output data indicative of theoptical flow of the image.

BRIEF DESCRIPTION OF THE DRAWINGS

For a better understanding of the present invention and in order to showhow the same may be carried into effect, a preferred embodiment of anoptical flow estimation method in accordance with the present inventionwill now be described, by way of example, with reference to theaccompanying drawings, in which:

FIG. 1 illustrates a typical image sequence of an MPEG-2 encoded videodata stream;

FIG. 2 is a diagram providing a comparison of the Lucas Kanade method ofoptical flow estimation with that of a preferred embodiment of thepresent invention;

FIG. 3 illustrates the structure of the AC[1] and AC[8] coefficientswithin a DCT block of the encoded video data stream;

FIG. 4 shows a confidence map;

FIGS. 5 a to 5 d illustrate the steps for obtaining a smooth motionfield in accordance with a preferred embodiment of the presentinvention;

FIG. 6 is a diagram illustrating the basis of a confidence map updatestep in such an embodiment;

FIG. 7 is a diagram illustrating a typical scenario where the newconfidence map is a weighted average of the four associated DCT blocksin the I image.

FIGS. 8 a and 8 b illustrate the effects of thresholding on a scene withno real motion;

FIG. 9 illustrates the smooth motion field of the scene of FIGS. 8 a and8 b as provided by this embodiment;

FIGS. 10 a and 10 b show the effect of noise reduction on the confidencemap of the scene of FIGS. 8 a and 8 b;

FIGS. 11 a and 11 b illustrate the effect of noise reduction on theconfidence map of the scene of FIG. 5 a;

FIGS. 12 a and 12 b show the motion field generated by the Lucas Kanademethod and the preferred embodiment of the present inventionrespectively for the scene of FIG. 5 a;

FIGS. 13 a and 13 b show the motion fields of FIGS. 12 a and 12 bwithout the original image;

FIGS. 14 a and 14 b show further aspects of the motion fields of FIGS.12 a and 12 b without the original image;

FIGS. 15 a and 15 b illustrate the effects of blurring caused by threeconfidence update steps applied to the motion fields of FIGS. 14 a and14 b respectively; and

FIG. 16 is a flow diagram illustrating the steps of the method of thepreferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE DRAWINGS

The Lucas Kanade method of optical flow estimation noted above involvesdirect processing of pixel information. The preferred embodiment of thepresent invention to be described below closely approximates the LucasKanade method but working in the compressed video data domain, usingonly quantities related to the compressed video data.

A parallel between the Lucas Kanade method and that of the preferredembodiment is drawn in FIG. 2. Two tasks can be identified, namely thegeneration of a motion vector field with one vector per data macroblockand estimation of a confidence map for the motion vector field. Notevery data macroblock has one motion vector. Some data macroblocks havenone (intra-coded blocks) and some have two (interpolated blocks in Bpictures). Therefore a basis is needed upon which a smooth motion fieldbased on MPEG-2 motion vectors can be generated.

The obtained motion vector field, like the optical flow field, needssome sort of confidence measure for each vector to be meaningful. Areasof the motion vector field with strong edges exhibit better correlationwith real motion than textureless ones.

A DCT block is the set of 64 DCT coefficients which result fromapplication of a DCT to a data macroblock. A DCT coefficient is theamplitude of a specific cosine basis function, while an AC coefficientis a DCT coefficient for which the frequency in one or both of thedimensions is non-zero. DCT coefficients of an intra-coded datamacroblock have a measure of edge strength in the AC coefficients. TheAC[1] and AC[8] coefficients within such a macroblock, illustrated inFIG. 3, measure the strength and direction of an edge within the 8×8luminance block, and, drawing a parallel to the eigenvalues of the LucasKanade method, it is possible to use these two coefficients to obtain aset of confidence measures for the MPEG-2 compressed video data stream.

The AC[1] and AC[8] coefficients may be considered approximations to theaverage spatial gradients over a DCT data block. Let f(x,y) be the 8×8image block and F(u,v) be the block's DCT. The image can bereconstructed using the inverse DCT:

${f\mspace{11mu}\left( {x,y} \right)} = {\sum\limits_{u,v}{\frac{C_{u}}{2}\frac{C_{v}}{2}F\mspace{11mu}\left( {u,v} \right)\mspace{11mu}\cos\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu} u\;\pi}{16} \right)\mspace{11mu}\cos\mspace{11mu}\left( \frac{\left( {{2y} + 1} \right)\mspace{11mu} v\;\pi}{16} \right)}}$

This reconstruction can be used for continuous values of x and y, notjust integers, and therefore the spatial gradient f_(x)(x,y) can beobtained by differentiating:

$\begin{matrix}{{{f_{x}\mspace{11mu}\left( {x,y} \right)} = {- {\sum\limits_{u,v}{\frac{u\;\pi}{8}\frac{C_{u}}{2}\frac{C_{v}}{2}F\mspace{11mu}\left( {u,v} \right)\mspace{11mu}\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu} u\;\pi}{16} \right)}}}}\mspace{11mu}} \\{\cos\mspace{11mu}\left( \frac{\left( {{2y} + 1} \right)\mspace{11mu} v\;\pi}{16} \right)}\end{matrix}$

A similar expression for spatial gradient f_(y)(x,y) can also beobtained. An average gradient over the complete image block can then becalculated as a weighted sum:

$\begin{matrix}{\overset{\_}{f_{x}} = {- {\sum\limits_{x,y}{w\mspace{11mu}\left( {x,y} \right)\mspace{11mu}{\sum\limits_{u.v}{\frac{u\;\pi}{8}\frac{C_{u}}{2}\frac{C_{v}}{2}F\mspace{11mu}\left( {u,v} \right)\mspace{11mu}\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu} u\;\pi}{16} \right)}}}}}} \\{\cos\mspace{11mu}\left( \frac{\left( {{2y} + 1} \right)\mspace{11mu} v\;\pi}{16} \right)}\end{matrix}$where w(x,y) is some spatial weighting function. A popular weightingfunction to choose is a Gaussian. However, by choosing the weightingfunction:

${w\mspace{11mu}\left( {x,y} \right)} = {\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu}\pi}{16} \right)}$for the x direction, the average gradient expression simplifies asfollows:

$\begin{matrix}{\overset{\_}{f_{x}} = {- {\sum\limits_{x,y}{\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu}\pi}{16} \right)\;{\underset{u.v}{\;\sum}{\frac{u\;\pi}{8}\frac{C_{u}}{2}\frac{C_{v}}{2}F\mspace{11mu}\left( {u,v} \right)\mspace{11mu}\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu} u\;\pi}{16} \right)}}}}}} \\{\cos\mspace{11mu}\left( \frac{\left( {{2y} + 1} \right)\mspace{11mu} v\;\pi}{16} \right)} \\{= {\underset{u.v}{- \sum}{\frac{u\;\pi}{8}\frac{C_{u}}{2}\frac{C_{v}}{2}F\mspace{11mu}\left( {u,v} \right)\mspace{11mu}{\sum\limits_{x}\;{\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu}\pi}{16} \right)\mspace{11mu}\sin\mspace{11mu}\left( \frac{\left( {{2x} + 1} \right)\mspace{11mu} u\;\pi}{16} \right)}}}}} \\{\sum\limits_{y}{\cos\mspace{11mu}\left( \frac{\left( {{2y} + 1} \right)\mspace{11mu} v\;\pi}{16} \right)}} \\{= {{{- \pi}\; C_{1}C_{0}F\mspace{11mu}\left( {1,0} \right)} = {{- \frac{\pi}{\sqrt{2}}}F\mspace{11mu}\left( {1,0} \right)}}}\end{matrix}$where C_(u) and C_(v) being functions of u and v that equal 1/√{squareroot over (2)} if u or v are 0 and C_(u) and 1 if u or v have any othervalue. C₀ and C₁ are functions of u where u=0 and u=1 respectively.

Similar analysis can be performed for f _(y). Thus with appropriateweighting the AC[1] (F(1,0)) and AC[8] (F(0,1)) coefficients can beinterpreted as negative weighted average gradients over the DCT datablock.

To summarise the above equations, the coefficients AC[1] and AC[8]approximate to the average spatial gradient within a block:AC[1]∝− f _(x)AC[8]∝− f _(y).

Instead of constructing a matrix for the gradients at each pixel andthen averaging, as in the Lucas Kanade method, the preferred embodimentof the present invention uses the steps of averaging the gradients (i.e.using the DCT coefficients) and then constructing a single matrix M′:

$M^{\prime} = {{\begin{bmatrix}{- {{AC}\mspace{11mu}\lbrack 1\rbrack}} \\{- {{AC}\mspace{11mu}\lbrack 8\rbrack}}\end{bmatrix}\begin{bmatrix}{- {{AC}\mspace{11mu}\lbrack 1\rbrack}} & {- {{AC}\mspace{11mu}\lbrack 8\rbrack}}\end{bmatrix}}.}$

This matrix will, by definition, be singular with only one significanteigenvalue:λ_(l) ′=AC[1]² +AC[8]²which gives a measure of the confidence of the motion vector in thedirection of the eigenvector:

${\overset{\_}{e}}_{1}^{\prime} = {\begin{bmatrix}{- {{AC}\mspace{11mu}\lbrack 1\rbrack}} \\{- {{AC}\mspace{11mu}\lbrack 8\rbrack}}\end{bmatrix}.}$

A strong eigenvalue signals a large image gradient, e.g. an edge, inthis block and the associated eigenvector will be normal to the edge'sdirection. Only the magnitude of λ₁ has been used as a confidencemeasure, but the difference between the direction of the motion vectorand e_(l) could also be analyzed.

M′ can be related to the corresponding Lucas Kanade matrix M to showthat the resulting confidence estimates are inherently conservative. Thedetails of this are given in Appendix A.

An MPEG-2 compressed video data stream has a variable number of DCTblocks for each macroblock that varies with the chroma quantizationencoding parameter. Only the luminance DCT blocks are used in thisembodiment, because there are always four per data macroblock (unlikethe chrominance blocks that can have two, four or eight). Another reasonis that the standard Lucas Kanade method only uses luminance values, sothat, since the preferred embodiment attempts to approximate it, itshould use only luminance as well. FIG. 4 shows an example of aconfidence map, generated from the AC[1] and AC[8] coefficients, of theform AC[1]²+AC[8]², and its correlation with edge strength.

As previously noted, not every data macroblock has one associated motionvector. In order to obtain a smooth motion field, a number of rules areimplemented as follows:

1) Macroblocks with no motion vector have the same movement as in theprevious image.

2) When a macroblock has two motion vectors, the one pointing back isreversed and added to the one pointing forward.

3) Motion vector magnitude is normalized (motion vectors in P picturesspan three images but this does not happen with motion vectors in Bpictures, so that scaling is required).

4) Skipped macroblocks in I-pictures have no movement while in Ppictures they have movement similar to the previous block.

Applying these rules removes the dependency on specific MPEG-2characteristics, such as picture and macroblock type, and creates amotion field with one vector per macroblock with standardized magnitude.A spatial median filter is then applied to remove isolated vectors thathave a low probability of reflecting real movement in the image. FIGS. 5a and 5 b illustrate some stages of this process, FIG. 5 a showing theoriginal image, and FIG. 5 b showing the raw vectors extracted from thedata macroblocks of the compressed video data stream. FIG. 5 c shows themotion field after application of the above rules, and FIG. 5 d showsthe smoothed motion field.

Since P and B pictures transmit mostly inter-coded blocks, theconfidence map is only fully-defined for I pictures that typically onlyoccur every 12 pictures. The typical picture sequence of FIG. 1illustrates this. A confidence map update step is required and thepreferred embodiment exploits the fact that, even if the new exactconfidence of a predicted data macroblock is unknown, the confidence ofthe original 16×16 area that the corresponding motion vector points atmay be estimated. FIG. 6 illustrates how an inter-coded macroblock (inthis case from a P picture) relates to a corresponding I pictureconfidence map. Since motion vectors in MPEG-2 compressed video datahave half-pixel accuracy, they rarely point to the top-left corner of aDCT data block, so that interpolation is required. The typical scenariowill be that the new confidence is a weighted average of four DCTblocks. FIG. 7 shows how DCT blocks may be used to estimate theconfidence of an error block.

There are two obvious problems that can arise from such an approach,namely excessive averaging and error propagation. In fact, everyconfidence map update step involves interpolating new confidences andthe confidence map is only reset on I pictures that occur every 12pictures in a typical IBBPBBPBBPBB sequence. In practice, this problemis not that serious if adequate measures are taken. An obvious measureis that updates should be kept to a minimum in order to avoid excessiveblurring of the confidence map. In fact, an update is only necessarywhen a vector is larger than half a DCT data block, if a vector issmaller than this there is a low probability that that edge has movedsignificantly. Also, since motion vectors of a P picture only refer tothe previous I or P picture, there is a maximum of three update steps,making error propagation less serious. Confidence map updating in Bpictures may depend on the magnitude of motion present in the compressedvideo data sequence.

Areas of the image from which edges have moved are unreliable, so thateither new edges move to these areas or they are marked with zeroconfidence. This happens when a moving object uncovers unknownbackground, possibly generating random motion that, since it has zeroconfidence, is ignored.

The method of the preferred embodiment will now be described withreference to FIG. 16 as a sequence of steps that may be carried outwithin a digital processor under the control of a computer program.Encoded image data representative of an image of a changing object,preferably in an MPEG-2 encoded format, is supplied to the digitalprocessor from a video camera, for example. Step 1 corresponds toobtaining a frame header from the encoded image data, which, for anMPEG-2 data stream, will comprise information describing whether theframe is an I, B or P picture.

At step 2 a decision is made as to whether or not the encoded video dataframe incorporates motion vector encoding. Pictures used exclusively forreference purposes do not contain motion vector encoding, whereaspredicted pictures contain motion vector information describing how toobtain the predicted picture from a reference picture. When the encodedvideo data is MPEG-2 encoded data, I pictures have no motion vectorencoding, whereas both B and P pictures do incorporate motion vectorencoding. The decision is therefore made on the basis of informationextracted from the frame header.

If the decision made at step 2 is that the encoded image data does notincorporate motion vector encoding then, at step 3, first frame datablocks not incorporating motion vector encoding are extracted. In thecase of MPEG-2 encoded data, this step comprises extracting macroblocksof the I pictures yielding DCT coefficients. AC coefficients AC[1] andAC[8] are a subset of the DCT coefficients, and provide information onthe strength and direction of edges in the real image.

At step 4 the encoded image data not containing motion vector encodingis used to generate a confidence map indicative of the edge strengthwithin the image data and hence the accuracy of the motion field. TheAC[1] and AC[8] coefficients are used to create the confidence map forMPEG-2 encoded data.

If the decision made at step 2 is that the encoded image data doesincorporate motion vector encoding then, at step 5, second frame datablocks incorporating motion vector encoding are extracted. In the caseof MPEG-2 encoded data, this step comprises extracting macroblocks ofthe B and P pictures.

At step 6, the second frame data blocks are used to derive smooth motionfield data blocks in which each data block has a single motion vectorand the magnitudes of the motion vectors are normalised. This isachieved by application of a set of rules which compensate for not everymacroblock having one motion vector or motion vectors of a normalisedmagnitude. As noted above, such rules remove the dependency on thespecific format in which the image data is encoded.

The confidence map data determined in step 4 above is updated at step 7on the basis of the smooth motion field data blocks derived in step 6.The motion vectors relate regions of the first frame data blocks toregions of the second frame data blocks. This provides a function takingthe confidence map from the first frame onto the second frame. Becausethe motion vectors typically do not exactly map a DCT block in an Iimage to a DCT block in a P or B image it is necessary to interpolateacross the confidence map as depicted in FIG. 6. In one embodiment thenew confidence map is a weighted average of the confidence values forthe neighbouring 4 DCT blocks as shown in FIG. 7.

Steps 4 and 7 thereby together provide an updated confidence map for thesmoothed motion vectors, leading to an estimate of the high confidenceoptical flow within the image, as performed in step 8.

To obtain a dense optical flow field the high confidence optical flowdata can be spatially interpolated as performed in step 9.

Both the motion vector field and the associated confidence map areaccordingly estimated. Magnitude and confidence map thresholding can beapplied to remove the majority of the noisy vectors, removing gloweffects of lights on shiny surfaces and shadows.

Such data is somewhat sensitive to a fixed threshold and accordingly amotion segmentation algorithm using the motion estimation of thepreferred embodiment could use more flexible decision methods (adaptivethresholding, multi-step thresholding, probabilistic models, etc.). Forthe above-mentioned pedestrian detection scenario a fixed threshold issufficient, and accordingly the examples referred to below use this forsimplicity of visualization of the results.

Using the scene of FIGS. 8 a and 8 b by way of a first example, theeffect of using confidence thresholding to remove illumination noise andincrease the accuracy of the motion estimation can be observed. In thisspecific scene, bright lights and a shiny pavement generate a lot ofapparent motion that does not reflect real motion (FIG. 9). The majorityof the illumination noise is removed except for the noise on thehandrails where there are strong edges and a lot of apparent motion.FIGS. 10 a and 10 b show the noise reduction effect of the confidencemap before and after thresholding respectively. Such noise can be dealtwith by a subsequent motion segmentation algorithm. A second example isshown in FIGS. 11 a and 11 b with the thresholded low-resolution motionfield (16×16 blocks) in a scene with moving objects being shown beforeand after confidence thresholding in FIGS. 11 a and 11 b respectively.

Motion fields generated according to the Lucas Kanade method providedense motion maps in that a motion vector is provided for every pixel.This can be seen in FIG. 12 a where pixels with vectors pointing rightare shown in white and vectors pointing left are show in grey for thesub-sampled Lucas Kanade motion field. FIG. 12 b shows the same scene,with the MPEG-2 smooth motion field. For ease of comparison FIG. 13 ashows this Lucas Kanade motion field subsampled to the same resolutionas the proposed scheme, and FIG. 13 b shows how successfully the smoothmotion field of the present invention approximates it. FIGS. 14 a and 14b show both motion fields corresponding to another situation (thatdepicted in FIG. 5 a), these figures showing the subsampled Lucas Kanadeand MPEG-2 smooth motion fields respectively. A close match may beobserved between the Lucas Kanade motion field and that of the preferredembodiment of the present invention.

An example of the blurring effect caused by the confidence map updatestep is shown in FIGS. 15 a and 15 b. The most severe case is presented(after three update steps) to show that, even with some blurring, theapproximation is still quite accurate. Some definition is lost in theshape of the object, but most of the motion is still correct. FIG. 15 ashows the MPEG-2 smooth motion field, and FIG. 15 b the Lucas Kanademotion field.

The approximation to the Lucas Kanade method of optical flow estimationprovided by the preferred embodiment of the present invention isobtained at very low computational cost. All of the operations aresimple and are applied to a small data set (44*30 macroblocks, 44*30*4DCT blocks). This, allied with minimal MPEG-2 decoding, easily allowsframe rates of over 25 images per second with unoptimised code. A roughcomparison with Lucas Kanade method code shows that the algorithm of thepreferred embodiment is approximately 50 times faster. With appropriateoptimization, faster performance is possible. Given its low complexity,the method of the present invention may be implemented in a low costreal time optical flow estimator based around an inexpensive digitalsignal processing chipset.

Although the smoothing effect of the weighted averaging is small, it ispossible that it may be reduced even further. Object segmentation islikely to improve it, reducing blurring significantly and allowing morerobust motion estimation. A more specific improvement would be to uselow-resolution DC images for static camera applications where abackground subtraction technique, allied with the motion estimation ofthe present invention, might be used for robust foreground objectdetection and tracking.

The present invention provides a method for estimating optical flow,particularly optical flow as estimated by the well-known Lucas Kanademethod, using only compressed video data. It will be appreciated by theperson skilled in the art that various modifications may be made to theabove described embodiments without departing from the scope of thepresent invention.

Appendix A

The Lucas Kanade and present invention matrices can be approximatelycharacterized as follows:

$\overset{\_}{M} \approx {E\left\{ {\begin{bmatrix}f_{x} \\f_{y}\end{bmatrix}\begin{bmatrix}f_{x} & f_{y}\end{bmatrix}} \right\}}$ and $M^{\prime} \approx {\begin{bmatrix}{E\mspace{11mu}\left\{ f_{x} \right\}} \\{E\mspace{11mu}\left\{ f_{y} \right\}}\end{bmatrix}\begin{bmatrix}{E\mspace{11mu}\left\{ f_{x} \right\}} & {E\mspace{11mu}\left\{ f_{y} \right\}}\end{bmatrix}}$respectively, where E{o} represents expectation over the image pixelsand the scale difference in M′ has been ignored. It then follows fromJensen's inequality (see Cover and Thomas, “Elements of InformationTheory” 1991, Wiley & Sons.) that for any direction w:w^(T)M′w≦w^(T) Mwwhere (w^(T) Mw)⁻¹ represents the variance of the optical flow estimatein the direction w for the Lucas Kanade method. Thus the singular matrixM′ always gives larger variance estimates, so that it is an inherentlyconservative confidence measure.

1. An optical flow estimation method comprising: obtaining encoded imagedata representative of an image sequence of a changing object having amotion field; extracting from said encoded image data first frame datablocks not incorporating motion vector encoding; extracting from saidencoded image data second frame data blocks incorporating motion vectorencoding; determining from said first frame data blocks confidence mapdata indicative of the edge strength within said encoded image data andhence the accuracy of the motion field; deriving from said second framedata blocks smooth motion field data blocks in which each data block hasa single motion vector and the magnitudes of the motion vectors arenormalised; and updating the confidence map data on the basis of thesmooth motion field data blocks to provide output data indicative of theoptical flow of the image.
 2. The method according to claim 1, whereinthe encoded image data is MPEG-2 encoded video data.
 3. The methodaccording to claim 1, wherein the first frame data blocks arerepresentative of luminance data of said encoded image data.
 4. Themethod according to claim 3, wherein the first frame data blocksextracted from said encoded image data are representative of a discretecosine transform (DCT) of the luminance data.
 5. The method according toclaim 4, wherein the confidence map data is determined from weighted ACcoefficients of the discrete cosine transform (DCT) representative ofthe intensity gradients in mutually transverse directions.
 6. The methodaccording to claim 2, wherein the confidence map data is determined fromthe weighted AC[1] and AC[8] coefficients of the MPEG-2 encoded videodata representative of the intensity gradients in mutually transversedirections.
 7. The method according to claim 5, wherein the confidencemap data is determined from the sum of the squares of the weighted ACcoefficients of the discrete cosine transform (DCT) representative ofthe intensity gradients in mutually transverse directions.
 8. The methodaccording to claim 1, wherein the smooth motion field data blocks arederived from said second frame data blocks by a transformation in which,where a second frame data block has no motion vector, the correspondingfield data block is ascribed the same motion vector as the immediatelypreceding field data block.
 9. The method according to claim 1, whereinthe smooth motion field data blocks are derived from said second framedata blocks by a transformation in which, where a second frame datablock has two motion vectors pointing in opposite directions, thecorresponding field data block is ascribed a motion vector in one of thedirections having a magnitude corresponding to the sum of the magnitudesof said two motion vectors pointing in opposite directions.
 10. Themethod according to claim 1, wherein the smooth motion field data blocksare derived from said second frame data blocks using spatial filteringto suppress isolated smooth motion field data blocks having a lowprobability of reflecting real movement.
 11. The method according toclaim 1, wherein the confidence map data is updated when the vector of asmooth motion field data block has a magnitude exceeding a certainthreshold.
 12. An optical flow estimation system utilising encoded imagedata representative of an image sequence of a changing object having amotion field, the system comprising: first extraction means forextracting from said encoded image data first frame data blocks notincorporating motion vector encoding; second extraction means forextracting from said encoded image data second frame data blocksincorporating motion vector encoding; determination means fordetermining from said first frame data blocks confidence map dataindicative of the edge strength within said encoded image data and hencethe accuracy of the motion field; derivation means for deriving fromsaid second frame data blocks smooth motion field data blocks in whicheach data block has a single motion vector and the magnitudes of themotion vectors are normalised; and updating means for updating saidconfidence map data on the basis of said smooth motion field data blocksto provide output data indicative of the optical flow of the image. 13.The system according to claim 12, further adapted to receive MPEG-2encoded video data.
 14. The system according to claim 12, wherein thefirst frame data blocks are representative of luminance data of saidencoded image data.
 15. The system according to claim 14, wherein thefirst extraction means is arranged to extract the first frame datablocks such that the first frame data blocks are representative of adiscrete cosine transform (DCT) of the luminance data.
 16. The systemaccording to claim 15, wherein the determination means is arranged todetermine the confidence map data from weighted AC coefficients of thediscrete cosine transform (DCT) representative of the intensitygradients in mutually transverse directions.
 17. The system according toclaim 13, wherein the determination means is arranged to determine theconfidence map data from the weighted AC[1] and AC[8] coefficients ofthe MPEG-2 encoded video data representative of the intensity gradientsin mutually transverse directions.
 18. The system according to claim 16,wherein the determination means is arranged to determine the confidencemap data from the sum of the squares of the weighted AC coefficients ofthe discrete cosine transform (DCT) representative of the intensitygradients in mutually transverse directions.
 19. The system according toclaim 12, wherein the derivation means is arranged to derive the smoothmotion field data blocks from said second frame data blocks by atransformation in which, where a second frame data block has no motionvector, the corresponding field data block is ascribed the same motionvector as the immediately preceding field data block.
 20. The systemaccording to claim 12, wherein the derivation means is arranged toderive the smooth motion field data blocks from said second frame datablocks by a transformation in which, where a second frame data block hastwo motion vectors pointing in opposite directions, the correspondingfield data block is ascribed a motion vector in one of the directionshaving a magnitude corresponding to the sum of the magnitudes of saidtwo motion vectors pointing in opposite directions.
 21. The systemaccording to claim 12, wherein the derivation means is arranged toderive the smooth motion field data blocks from said second frame datablocks using spatial filtering to suppress isolated smooth motion fielddata blocks having a low probability of reflecting real movement. 22.The system according to claim 12, incorporating a digital processor. 23.A tangible computer readable storage medium having computer readableinstructions stored thereon, the instructions comprising: instructionsfor extracting, from encoded image data representative of an imagesequence of a changing object having a motion field, first frame datablocks not incorporating motion vector encoding; instructions forextracting from said encoded image data second frame data blocksincorporating motion vector encoding; instructions for determining fromsaid first frame data blocks confidence map data indicative of the edgestrength within said encoded image data and hence the accuracy of themotion field; instructions for deriving from said second frame datablocks smooth motion field data blocks in which each data block has asingle motion vector and the magnitudes of the motion vectors arenormalised; and instructions for updating the confidence map data on thebasis of the smooth motion field data blocks to provide output dataindicative of the optical flow of the image.